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Computational Methods 

• Methods of Inference 

- Hand-crafted rules 

- Statistical Methods: N -Grams 

- Statistical Methods: Decision Trees 

- Statistical Methods: Decision Lists 

- Mixed Methods 

• Methods of Implementation 

- Weighted Finite-State Acceptors/T mnsducers 
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Hand-Crafted Rulesets 



• Context-free syntactic rewrite rules: S ^ NP VP 

• Phonological rewrite rules: C [—voiced]/ # 

• Tree-to-tree transduction rules: 

; ; input pattern 

( (E_ADJ (= CAT E_ADJ) 

(MORPH-0 (= LEX "U_S"))) 
; ; output pattern 
(E_ADJ 

(MORPH-0 

(& (= LEX "american") 

(= SRC "U_S") ) ) 

(0 ( & (= cat s_gen) 

(= sgen (e_adj sgen) ) ) ) 

(0 (& (= cat s_num) 

(= snum (e_adj snum) ) ) ) ) ) 
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Hand-Crafted Rulesets: a Speech Example 

Some Rules for foreign name pronunciation in English 

German 

sch^ / 



ei ^ 



French 

eau o 
Cons ^0/ # 

Japanese 

(J ^ & I a # 
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Hand-Crafted Rules: Advantages/Limitations 

• Advantages: 

- Easy to encode linguistic knowledge directly and precisely 

- Resulting rules are (usually) readily comprehensible to a linguist 

• Limitations: 

- Rulesets are often large and complicated: 

* Construction is costly 

* Rule interactions are hard to manage. However, rule 
development environments — e.g. TWOL (Dalrymple et al., 
1987) — are useful here 

- Nonprobabilistic: 

* Systems usually output all possible analyses without associated 
weights 

* Bad for many speech applications 
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N-Grams: Basics 



'Chain Rule' and Joint/Conditional Probabilities: 

P[xiX2 . . . xat] = P[xn\xi...xn-i]P[xn-i\xi...xn-2] . . . P[x2|xi]P[xi] 



where, e.g., 



P[xn\xi . . . xn-i] = 



P[xi . . .Xn] 

P[xi . . . xn-i] 



(First-Order) Markov assumption: 



P[xk\xi . ..Xk-l] = P[Xk\Xk-l] = 



nth-Order Markov assumption: 



P^Xj^ X\ . . . Xj^ — i^ P^Xj^ Xj^ — ji...Xj^ — i^ 



P[a;fc-ia;fc] 

P[Xk-l] 
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N-Grams: Maximum Likelihood Estimation 

Let N be total number of n-grams observed in a corpus and c{xi . . . x^) 
be the number of times the n-gram xi . . .Xn occurred. Then 

y-^r -1 C(XX . . . 

i\X\ . . . Xji\ 

is the maximum Hkelihood estimate of that n-gram probabihty. 
For conditional probabilities, 

y-^r -1 C{X\ . . . 

lyXji X\ . . . Xji — \\ - - . 

C\X\ . . . Xji— \ J 

is the maximum likelihood estimate. 

With this method, an n-gram that does not occur in the corpus is assigned 
zero probability. 
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N-Grams: Good-Turing-Katz Estimation 

Let Ur be the number of n-grams that occurred r times. Then 



N 



is the Good-Turing estimate of that n-gram probability, where 

c*{x) = (c(x) + 1)^. 

For conditional probabilities, 

pr -1 c (xi . . . Xji) / "i \ n 

lyXji X\ . . . ij / \-) C\X\ . . . Xji j ^ \j 

C\X\ . . . Xji — \ J 

is Katz's extension of the Good-Turing estimate. 

With this method, an n-gram that does not occur in the corpus is assigned 
the backoff probability P[xn|xi . . . Xn-i] = aP[xn\x2 . . . ^n-i], where 
a is a normalizing constant. 
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N-Grams: Advantages/Limitiations 



Advantages: 

- Captures local, conditional probabilistic information well with 
adequate data. 

- Simple to use/understand. 

- Efficient implementation. 



• Limitations: 

- Fails to capture wider-context information. 

- Only limited degree of context generalization (from the back-off), 
so technique is weak when data is sparse (cf., manual/automatic 
context clustering). 
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Stochastic Part-of-Speech Assignment 

• Words may have multiple grammatical parts of speech: 
He/PPS will/MD tableA^B the/AT motion/NN 
The/AT table/NN is/BEZ ready/JJ 

Can/MD they/PPSS canA^B cans/NNS 

• Solution is to use an n-gram model trained on a large corpus of 
tagged text (Church, 1988; DeRose, 1988), or on an untagged corpus 
using a dictionary and a reestimation procedure (Kupiec, 1992). 



n—l 



p{wordi \parti) p{parti \parti-\parti-2) 



argmax Y[ 




i=2 
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Language-of-Origin Identification for Names 



Name 


Language 


Vitale 


Italian 


Fujisaki 


Japanese 


Rodrwuez 

o 


Spanish 


Blaustein 


German 


Andruszkiewicz 


Polish 


Perrault 


French 



• Solution : letter trigrams can be used to model the graphotactics of 
each language (Church, 1986; Vitale, 1991). 
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Language-of-Origin Identification for Names 



Trigram probabilities for Vitale (Vitale, 1991, p. 265) 
Trigram |T) (Italian) p{L2\T) ... p(L,|T) 



#vi 


.4659 


.0679 


.2093 


vit 


.4145 


.0263 


. . . .0000 


ita 


.7851 


.0490 


. . . .0564 


tal 


.4422 


.1013 


. . . .2384 


ale 


.2602 


.0867 


. . . .2892 


le# 


.3181 


.1884 


. . . .0688 


mean 


.4477 


.0866 


. . . .1437 
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Foreign Name Detection in Chinese 



m tf m , ^ ii m Sir m ^ m sir H m n m 

yi3 bo2-genl mi4-de2-sa4-sil he2 meng4-mo4-sil sanl jun4 wei2 li4 

take Bergen Middlesex and Monmouth three county for example 

taking the three counties of Bergen, Middlesex and Monmouth, as an example 



s M h: ^ g 11 

zai4 niu3-wa3-ke4 tu2-shul-guan3 
in Newark library 

in Newark library 

• Only a couple of hundred characters are at all common in 

transliterating foreign names, so can build a simple n-gram model 
modeling these characters (see Sproat et al., 1994). 



M. Riley & R. Sproat 



Text Analysis Tools in SLP, June 27, 1994 



N-Gram Methods 



14 



Decision Trees: Overview 



• Description/Use: Simple structure - binary tree of decisions, 
terminal nodes determine prediction (cf. "Game of Twenty 
Questions''). If dependent variable is categorical (e.g., red, 
yellow, green), called "classification tree", if continuous, called 
"regression tree". 

• Creation/Estimation: Creating a binary decision tree for 
classification or regression involves three steps (Breiman, et al)\ 

1. Splitting Rules: Which split to take at a node? 

2. Stopping Rules: When to declare a node terminal? 

3. Node Assignment: Which class/value to assign to a terminal node? 
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1. Decision Tree Splitting Rules 



Which spHt to take at a node? 



Candidate splits considered. 

- Binary cuts: For continuous — oo < x < oo, consider spHts of 
form: 

X < k vs. X > fc, Vfc. 

- Binary partitions: For categorical x G {1, 2, n} = X, 
consider spHts of form: 

X e A vs. X e X - A, yAc X. 
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1. Decision Tree Splitting Rules - Continued 

• Choosing best candidate split. 

- Method 1 : Choose k (continuous) or A (categorical) that 
minimizes estimated classification (regression) error after split. 

- Method 2 (for classification): Choose k or A that minimizes 
estimated entropy after that split. 



SPLIT #1 SPLIT #2 
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2. Decision Tree Stopping Rules 

When to declare a node terminal? 

• Strategy {Cost-Complexity pruning): 

1. Grow over-large tree. 

2. Form sequence of subtrees, To, ranging from full tree to 
just the root node. 

3. Estimate "honest" error rate for each subtree. 

4. Choose tree size with mininum "honest" error rate. 

• To form sequence of subtrees, vary a from (for full tree) to oo (for 
just root node) in: 

n^n [ R{T) ^a\T\ . 

• To estimate "honest" error rate, test on data different from training 
data, e.g., grow tree on 9/10 of available data and test on 1/10 of data 
repeating 10 times and averaging (cross-validation). 
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End of Declarative Sentence Prediction: Pruning 

Sequence 



to 



IT) 
CM 
O 



LO 
O 



CD O 

o 

\— 



o 
o 



p 
o 








20 



40 



60 



80 



100 



# of terminal nodes 
+ = raw, = cross-validated 
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3. Decision Tree Node Assignment 



Which class/value to assign to a terminal node? 

• Plurality vote: Choose most frequent class at that node for 
classification; choose mean value for regression. 



M. Riley & R. Sproat Text Analysis Tools in SLP, June 27, 1994 Decision-Tree Methods 20 



End-of-Declarative-Sentence Prediction: Features 



• Prob[word with occurs at end of sentence] 

• Prob[word after occurs at beginning of sentence] 

• Length of word with 

• Length of word after 

• Case of word with Upper, Lower, Cap, Numbers 

• Case of word after Upper, Lower, Cap, Numbers 

• Punctuation after (if any) 

• Abbreviation class of word with - e.g., month name, 
unit-of-measure, title, address name, etc. 
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End of Declarative Sentence? 




5137/5283 133/152 
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Phoneme-to-Phone Alignment 

PHONEME PHONE WORD 



p p purpose 

er er 

P pel 

P 

ax ix 

s s 

ae ax and 

n n 

d 

r r respect 

ih ix 

s s 

P pel 

p 

eh eh 

k kcl 

t t 
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Phoneme-to-Phone Realization: Features 



• Phonemic Context: 

- Phoneme to predict 

- Three phonemes to left 

- Three phonemes to right 

• Stress (0, 1, 2) 

• Lexical Position: 

- Phoneme count from start of word 

- Phoneme count from end of word 
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Phoneme-to-Phone Realization: Prediction Example 



Tree splits for / t/ in ^ ^your pretty red' ' : 



PHONE 


COUNT 


SPLIT 


ix 


182499 




n 


87283 


cmO: vstp,ustp,vfri,ufri,vaff,uaff,nas 


kcl+k 


38942 


cmO: vstp,ustp,vaff,uaff 


tcl+t 


21852 


cpO: alv,pal 


tcl+t 


11928 


cmO: ustp 


tcl+t 


5918 


vml: mono,rvow,wdi,ydi 


dx 


3639 


cm-1: ustp,rho,n/a 


dx 


2454 


rstr: n/a,no 



(Riley, 1991). 
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Phoneme-to-Phone Realization: Network Example 



Phonetic network for ^ ^Don had your pretty. . . ' ' : 

PHONEME PHONEl PHONE2 PHONE3 CONTEXT 

d 0.91 d 

aa 0.92 aa 

n 0.98 n 

hh 0.74 hh 0.15 hv 

ae 0.73 ae 0.19 eh 

d 0.51 del jh 0.37 del d 

y 0.90 y (if d^del d) 

0.84- 0.16 y (ifd^deljh) 

uw 0.48 axr 0.29 er 

r 0.99 - 

p 0.99 pel p 

r 0.99 r 

ih 0.86 ih 

t 0.73 dx 0.11 telt 

iy 0.90 iy 
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Decision Trees: Advantages/Limitations 

• Advantages: 

- Handles continuous and categorical variables naturally. 

- Cross-validation gives results that generalize to new data. 

- Efficient algorithms - (approx. nlogn, n = no. of obs.) 

- Small/medium-sized trees are easy interpret/modify. 

• Limitations: 

- Recursive splitting can quickly cause "data starving" and 
replication of structure. 

- Categorical variables with large number of alternatives 
computationally unwieldy. 

- Large trees are hard to interpret/modify. 
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The Replication Problem (Pagallo and Haussler) 

• Smallest decision tree for DNF expression xiX2 + X3X4X5: 
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The Replication Problem - Continued 

• Equivalent tree using complex features in splits: 




• Decision List Representation: 



Expression 


Value 


X1X2 


1 




1 


True 






M. Riley & R. Sproat 



Text Analysis Tools in SLP, June 27, 1994 



Decision-List Methods 



29 



Decision List Creation/Estimation 

The Separate and Conquer Algorithm (Pagallo and Haussler) begins with 
a set of examples S, an auxilary set P {the pot), and an empty decision Hst 
DL. 

1. Select primitive feature that minimizes the entropy in S after splitting 
on that feature. 

2. Retain purer half of split in S and place the rest in P. If there is only 
one class label in S, then goto 3, else goto 1. 

3. Add complex feature that is the conjunction of the primitive features 
found in Steps 1 - 2 to the decision list DL with the class label of the 
examples in S. 

4. If there is only one class label in P, then add that class label as the 
default case to the decision list DL and stop, else S ^ P and P ^ O 
and goto 1. 
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Decision List Creation/Estimation - Continued 

• Alternative approaches: 

- Choose complex features from all combinations of k primitive 
features (Rivest) or from particular combinations selected from 
the problem domain (Yarowsky 1992, 1994). 

- Do not partition the data after each decision, but reuse all the data, 
disallowing the previous decisions from recurring. More 
generally, interpolate by using a linear combination of the global 
and residual data (Yarowsky 1992, 1994). 

• Pruning Stategies: 

- Evaluating held-out data, iteratively remove any decisions that do 
not improve the performance (Pagallo and Haussler; Yarowsky 
1992, 1994). 

- Remove any redundant decisions subsumed by prior decisions 
(Yarowsky, 1992, 1994). 
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Lexical Ambiguity Resolution 

• Word sense disambiguation: 

She handed down a harsh sentence, peine 
This sentence is ungrammatical. phrase 

• Homograph disambiguation: 

He plays bass. /be^s/ 
This lake contains a lot of bass, /baes/ 

• Diacritic restoration: 

appeler 1' autre cote de I'atlantique cote 'side' 
Cote d' Azur cote 'coast' 

(Yarowsky, 1992; Yarowsky 1994; Sproat, Hirschberg & Yarowsky, 1992; 
Hearst 1991) 
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Homograph Disambiguation 1 

• N-Grams 



Evidence 


Isd 


lid 


Logprob 


lead level/N 


219 





11.10 


6>/lead in 


162 





10.66 


the lead in 





301 


10.59 


lead poisoning 


110 





10.16 


lead role 





285 


10.51 


narrow lead 





70 


8.49 


lead in 


207 


898 


1.15 
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Homograph Disambiguation 1 



• Predicate- Argument Relationships 





followN + lead 





527 


11.40 




takeN + lead 


1 


665 


7.76 


• Wide Context 




zinc ^ lead 


235 





11.20 




copper ^ lead 


130 





10.35 



• other Features (e.g. Capitalization) 
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Homograph Disambiguation 2 



Sort bv Abs(Loa( ^^(^^Q^il^^^^^Q^^^^Q^O 

y \ Pr{Pron2\C ollocatioTii) ^ ^ 





Decision List for lead 


Logprob 


J—/ V J-VJ-^llw^ 


Prnminriati on 


11 40 


fn]]n^A)/V lead 


^ lid 


11 20 


71 TIC lead 


^ Isd 

/ X O VI- 


11.10 




^ Ipd 


10 66 

X \J • \J\J 


/if* 1 pad zV? 


^ kd 


10 59 


f/i^ lead m 


^lid 


10.51 


lead role 


^lid 


10.35 


copper ^ lead 


^led 


10.28 


lead time 


^lid 


10.16 


lead poisoning 


^led 
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Homograph Disambiguation 3: Pruning 

• Redundancy by subsumption 



Evidence 


lid 


kd 


Logprob 


lead level/N 


219 





11.10 


lead levels 


167 





10.66 


lead level 


52 





8.93 



• Redundancy by association 



Evidence 


tS3^ 


tl3^ 


tear gas 





1671 


tear ^ police 





286 


tear ^ riot 





78 


tear ^ protesters 





71 
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Homograph Disambiguation 4 

Choose single best piece of matching evidence. 





Decision List for lead 


Logprob 


Evidence 


Pronunciation 


11.40 


follow/V + lead 


^lid 


11.20 


zinc ^ lead 


^led 


11.10 


lead level/N 


^led 


10.66 


o/lead in 


^led 


10.59 


the lead in 


^lid 


10.51 


lead role 


^lid 


10.35 


copper ^ lead 


^led 


10.28 


lead time 


^lid 


10.16 


lead poisoning 


^led 
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Homograph Disambiguation: Evaluation 



Word 


Pronl 


Pron2 


Sample Size 


Prior 


Performance 


lives 


lalvz 


llvz 


33186 


.69 


.98 


wound 


waund 


wund 


4483 


.55 


.98 


Nice 


nals 


nis 


573 


.56 


.94 


Begin 


bl'gin 


belgin 


1143 


.75 


.97 


Chi 


tji 


kal 


1288 


.53 


.98 


Colon 


kou'loun 


'koUl9n 


1984 


.69 


.98 


lead (N) 


lid 


Isd 


12165 


.66 


.98 


tear (N) 


tS3^ 


tI3^ 


2271 


.88 


.97 


axes (N) 


'aeksiz 


'aeksiz 


1344 


.72 


.96 


IV 


al vi 


f3j6 


1442 


.76 


.98 


Jan 


d3aen 


jan 


1327 


.90 


.98 


routed 


lutid 


lautid 


589 


.60 


.94 


bass 


beis 


baes 


1865 


.57 


.99 


TOTAL 






63660 


.67 


.97 
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Chinese Homograph Disambiguation 



i~l xing2/hang2 



PTTF 


PRON 


PRFD 

c jr\i_i LJ , 








w 


xing2 


xing2 










V "i n PT ^ 
A. _L 1 i y ^ 


V "i n PT ^ 
A. _L 1 i y ^ 


:^ 








X "i n crP 


X i n crP 






o 




A _L 1 i y 


A _L 1 1 y 










xing2 


xing2 


□ 






1 — t 


I-II4 








hang2 


hang2 






-F'-l^ 


hang2 


hang2 






-F'-l^ 




liciliy Z 


Vi a ni-r9 
lio-Iiy Z 


1 J 


O 




1 


Vi a ni-r9 
lio-Iiy Z 




1^ 








hang2 


hang2 










hang2 


hang2 




J 




m 


xing2 


xing2 


lit 


J 




m 


xing2 


xing2 






m 


s§ 


xing2 


xing2 






If 


m 


xing2 


xing2 






^ 




xing2 


hang2 










hang2 


xing2 









'il ^^ta , MW PROCEED, DO 

^7 ^ll o P/fjy. PROCEED, DO 

^7 , m PROCEED, DO 

'a mm deeds 

'il m o DEEDS 

^7 ^ "n COMPANY 

^7 I^.E^T COMPANY 

'a ii COMPANY 

'il \H 'il COMPANY 



MEASURE WORD 



'ii m ^ ± 

'il }^7K TO TO ±ik MEASURE WORD 

'il +^3" -1 M TRAVEL 

'il T M B TRAVEL 

'il S§ Wll^ M TRAVEL 

^7 {"^^ , TRAVEL 

wrong 

fi^ ^ ^ wrong 



ic -k -k ic -k -k ic -k -k ic -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k -k PERCENT CORRECT' 94 
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Decision Lists: Advantages/Limitations 



• Advantages: 

- Efficient and flexible use of data. 

- Easy to interpret and modify. 

- Handles both wide and narrow context information. 



• Limitations: 

- New area; many aspects not well-studied - e.g., best complex 
feature selection rules, efficient pruning/cross-validation 
techniques, global vs. residual vs. interpolated dataset division. 
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Mixed Methods: Trained Hand-crafted Rules 

• Probabilistic parsing (e.g. Jelinek et al., 1990; Su et al., 1992; 
Goddeau, 1992) 

• Probabilistic morphological analysis (e.g. Heemskerk, 1993) 

Consider that the Dutch word beneveling 'intoxication' has two 
morphologically possible analyses (the second depends upon regular 
Dutch orthographic spelling changes): 

beyi^-\-nevelN+ingy\N BE- + mist + -ion 'intoxication' 
beyiN-\-neefN-^elingy\N BE- + nephew + -ling '??' 

Prefixation: B/A • A ^ B 
Suffixation: A • A\B ^ B 
Compounding: A • B ^ B 
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Probabilistic Morphological Parsing 



a 



N 

/\ 

N N\N A VW 

been e veel ing 

'leg ' 'much ' 

N 




b ^ 



V 




d 



N 




N VW 

neef eling 

(a) is ruled out by (hand-constructed) categorial grammar, while 

(b) is ruled out by prohibiting noun morphology from being input to verb morphology 



M. Riley & R. Sproat 



Text Analysis Tools in SLP, June 27, 1994 



Mixed Methods 



42 



Probabilistic Morphological Parsing 

p{[n[v [y/ivbe] [ivnevel]] [y \ A^ing]] ) = 

p{word^N) X 

p{N^VV\N)x 

p{V^V/N N)x 

p{V/N^he)x 

p{N ^neye\)x 

p{V\N^mg) 

p{[n[v [y/ivbe] [ivnevel]] [y \iving]] ) > 
p{[n[v [y/ivbe] [ivneef]] [y \ Adding]] ) 
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Probabilistic Morphological Parsing 



• Probabilities for production rules were calculated from the CELEX 
database containing 123,000 morphologically annotated Dutch stems. 



• Among 1612 structurally ambiguous words that had a correct analysis 
among the alternatives, probabilistic techniques gave the highest 
score to the correct analysis in 1483 / 1612 = 92% of the cases. 
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Weighted Finite-State Methods: Motivation 

• Unified representation for information sources: 

- strings/lattices 

- dictionaries 

- decoders/generators 

- language models 

• Uniform algorithms for: 

- combining information sources into generators, decoders, etc. 

- search 

- minimizing representations 

• Modular definition of language processors (cf . lex, yacc) 
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Weighted Finite-State Methods: Origins 

• Probabilistic automata (Paz, Taylor and Booth,. . . ) 

• Algebraic theory of languages and automata (Schutzenberger, 
Eilenberg, Berstel, Kuich & Salomaa,. . . ) 

• Hidden Markov models (...) 

• Theory of shortest path algorithms (Dijkstra, Aho, Hopcroft 
&Ullman,. . . ) 
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Transduction Cascades 



Standard "noisy channel" model: for given observations o, find 
message w that maximizes 



P{w,o) = P{o\w)P{w) 



Multistage cascade: 



P{SQ,Sk) = P{Sk\SQ)P{SQ) 



"Viterbi" version: 

P{Sk\SQ) 



= P{sk\so) ^ P{so) 



mm 



Sl,...,Sfc. 



1 ^l<j<k 



>J-1 



where X = — log X. 
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The Basic Generalizations 

• Weighted languages: functions from strings to weights, modeling 
information sources 

• Weighted transductions: functions from pairs of strings (one from 
each level) to weights, modeling mappings between levels of 
representation 

• Rational algebra: make complex languages and transductions from 
simple ones 

• Examples: 

- languages: phone sequence/lattice, language model 

- transductions: pronunciation dictionary, phoneme-to-phone 
realization, grapheme-to-phoneme conversion, text segmenter 
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Weights 

• Weight semiring: set of weights K with two commutative, associative 
operations: 

- sum: combines the weights of the ways of deriving an object to 
form the overall weight of the object 

- product: combines the weights of subobjects into the weight of 
the combined object; 

- product distributes with respect to sum 

• Examples: 



semiring 


sum 


product 





1 


Viterbi 


min 


+ 


+00 





probability 


+ 


X 





1 


boolean 


or 


and 





1 



M. Riley & R. Sproat 



Text Analysis Tools in SLP, June 27, 1994 



Implementation Methods: Theory 49 



Weighted Languages and Transductions 

• Generalized information source: weighted language 

L: £* ^ K 
behaviors weights 

• GeneraHzed transduction step: weighted transduction 

S': £* X r* ^ K 
inputs outputs weights 

• Combining levels — generalized composition: 

composition: (S'oT)(r, t)=J2ser* *^(^' ^)'^{^^ ^) 

application: {LoS){s) =^j,^y^* L{r)S{r^ s) 

reverse application: {SoM){r) =^g^Y* S{r^ s)M{s) 

intersection: {MoN){s) =M{s)N{s) 
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Rational Operations 

Making complex languages and transductions from simple ones: 

singleton {u}(v) = 1 iff u = v 

scaling {kX){u) = kX{u) 

sum {X + y)(u) = X(u) + y(u) 

concatenation (Xy)(w) = Euv^w ^(u)^(v) 

power X^{e) = 1, ^^(u / e) = 0, X^+i = XX 

closure ^* = Efc>o^^ 
Example — pronunciation dictionary: 



(p^w) probability that word w is realized as phone string p 

{J2w -^^y context-independent probabilities for realizations of 

word strings as phone strings 



n 



M. Riley & R. Sproat 



Text Analysis Tools in SLP, June 27, 1994 



Implementation Methods: Theory 



51 



Weighted Automata 

Weighted finite automata implement rational weighted languages and 
transductions 

• Automata transitions: 

q ^ q 

acceptor: x G L U {e}; transducer: x G (L U {e}) x (F U {e}); 
weight k 

• Example — word pronunciation transducer: 
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Automata Operations 



Operations between weighted rational languages and transductions 
have corresponding automata operations 

Generalized composition is implemented by a general automata join 
operation 




a:x/0.5 



3) o 



b:8/0.5 




b:y/l 



a:x/l 




Pruning: keep only those paths in the join within a beam of the best 
path 

Optimization: try to find a smaller weighted automaton with the same 
language/transduction 
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Rule Compilation 

• Theory of phonological rewrite rules and their implementation as 
finite-state transducers is well understood (Kaplan & Kay, 1994) 

• E.g.: (left-to-right) obligatory rule of the form 

can be modeled by composing a series of transducers 

Prologue o 
Id{Obligatory{(f)^ <i, >)) o 
Id{Rightcontext{p^ <, >)) o 

Replace o 
I d{Le f t context {X^ <, >)) o 
Prologue~^ 

each of which expresses a regular relation that restricts a certain 
portion of the application of the rule. 
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Rule Compilation: Statistical Rules 

A sample (toy) probabilistic ruleset 

{All} := ptkdaeiou\&R012 ;; 
{ Cons } : = ptkd ; ; 
{Vowel} := aeiou\&R ;; 
{ Stress } : = 12 ; ; 

End Prolog 

t -> ({DD}<0.20>, {tt}<4.32>, {dd}<4.64>, 

{??}<5. 64>, {DEL}<6. 64>) 

/ {Stress } {Cons }* {Vowel} 0{Vowel} ;; 

k -> ({??}<0.15>, {kk}<3.32>) / # ;; 

R -> ({&&}<0.15>, {RR}<0.14>) / ;; 

t -> ({tt}<4.32>, {??}<!. 8>) / # ;; 
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Rule Compilation: Statistical Rules 

Output for latORk given compiled ruleset from previous slide 
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Chinese Word Segmentation 



I forget NEG-POT liberation avenue be-at where 



^ T 



m wl 



(understand) 



X m 



(enlarge) 



"I couldn't forget where Liberation Avenue is." 



M. Riley & R. Sproat Text Analysis Tools in SLP, June 27, 1994 Implementation Methods: Examples 57 



Chinese Word Segmentation 




10.70 
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Chinese Word Segmentation 

"I couldn't forget where Liberation Avenue is." 



(not) 



T 



(understand) 



(enlarge) 



(street) 



^ 4.58" 



8.11 



bu4"^-^ adv V^liao3"^^^ jie3 ~^^y vb X^fang4"^^^ da4 vb "^v.xjiel"^^^ 



vb 
[Jli.77~ 



... ^y^Q^( 

wang4 vb 



0.85 



O "C) "C) "O "C^ * 

bu4 — liao3 ^""'^ \npot jie3 ^"'^ fang4^"'^ nc da4 



11.38 



10.92 



8 \ nc 

\ ^ 

\ 10.36 ^ 

\l 1 

\ 

\ 

\ 




. . . 



forget 



NEG-POT 



liberation 



avenue 



0.85 + 11.38 + 10.92 + 11.45 = 34.60 

wins over 

11.77 + 4.58 + 8.11 + 10.70 + 10.36 = 45.52 
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Some Text- Analysis Problems in Text-to-Speech Synthesis 

• Text-normalization issues 

- End-of-sentence detection 

- Word-segmentation (Chinese, Japanese, Thai) 

- Abbreviation expansion: is St. Saint or Streetl 

- Numeral interpretation: is 747 seven hundred and forty seven, or 
seven forty sevenl 

• Part-of-speech assignment 

• Word pronunciation 

- Morphological analysis of ordinary words and names 

- Homograph disambiguation 

• Accent prediction 

• Prosodic phrasing prediction 
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Pitch Accent Prediction 

Problem: predict accent status for different classes of words 

• Function versus content word: 
JOHN GAVE it to BILL 

He GAVE it to him 

• Long noun phrases: 
CITY HALL 
TAX office 

CITY hall TAX office 

• Preposing: 

We will BEGIN to LOOK at FROG anatomy today 
TODAY we will BEGIN to LOOK at FROG anatomy 

• Information status: 

My SON WANTS me to BUY a DOG , but I'm ALLERGIC to dogs. 
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Pitch Accent Prediction: Sample Variables 

For each word w\ 

• distance of w from beginning/end of sentence 

• total words in utterance 

• distance of w in words from/to prior/next boundary 

• part-of-speech of w 

• if It; is in complex nominal, the predicted accent of w given by an 
automatic noun-phrase accent predictor (Sproat, 1994) 

• information status 
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Pitch Accent Prediction: Sample Tree 

foo.snp 




clxlclit.cldeacc 

cl:clac'c,open 



fns:*,$,A' 



1970/28^8 

pbevb:BE,BER,BE^ 

/ pbevb:NA 




deacc 




cl 


80/151 1 


908/2657 


16 


17 



pron:PPO 

/pron:PPS',PPSS,NA 

C cl ) deacc (deacc) 
272/593 477/536 843/f642 

bevb:BE,BER,BEZ\ wh:WP$,WDT,WRB 

bevb:BEM,NA / whwt'S.NA 



110/228 
19 




fsbeg:<20.5 

fsbeg:>20.5 



^eacc ) cl 
jeSTSsk 11/12 



37 

sfns:*,lN,TO \ 

sfns:AT,CD,CS,EX,NA 




pnouns:NN \ sfns:CC,CD,CS,IN\ 

/ pnouns:NNS,NA sf'ns:$,AT,EX,TO,UH,NA 



cl 




deacc 




acc 




cl 


13/18 


167/335 


29/40 


112/213 


72 


73 


92 


93 



acc 
2/T35\ 



fns:CS 




/ 


fns:IN,NA 

\ 


deacc 




acc 


17/24 




66/1 1 1 


188 




189 



deacc 

584/866 
95 
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Pitch Accent Prediction: Results 



• Pitch Accents Predicted from the Audix Corpus: 80% Correct 

• Pitch Accents Predicted from the ATIS Database: 81.9% Correct 

• ATIS Predictions with Boundary Information: 85.1% Correct 

• ATIS Predictions with Speaker Information: 85.1% Correct 
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Pitch Accent Prediction: Hand-derived decision rules 



For each item Wi labeled with part-of-speech pi: 

If Wi is a phrasal verb, deaccent; 

Else if pi is classified 'closed-cliticized', cliticize; 

Else if pi is classified 'closed-deaccented', deaccent; 

Else if Wi is marked 'contrastive', 'prefixed', or 'preposed', assign 

it emphatic accent; 

Else if Wi is part of a proper nominal 

If Wi's status is 'given', assign emphatic accent, 

else assign a simple pitch accent; 
Else if Wi is in global focus but not in local focus ('given'), assign emphatic 
accent; 

Else if Wi is classified 'closed-accented', accent; 
Else if Wi is in local focus ('given'), deaccent; 
Else if Wi is part of a (common) complex nominal 

If Wi is predicted to be accented in citation form, accent 

else deaccent; 
Else accent Wi . 
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Word Pronunciation 



• Represent lexicon as FSA L. 

- Morphological derivatives of words in the lexicon can be represented 
using standard finite-state morphological techniques (Koskenniemi, 1983; 
Karttunen et al. 1992; Sproat 1992). 

- Corpus-derived weights can be added to arcs to rank multiple analyses. 

- Orthographic rules can be compiled into a (W)FST O that can be 
composed with the lexicon L to form a (W)FST L' that can 
morphologically decompose words as they occur in text. 

- Pronunciation rules (either hand-built or compiled from a trained decision 
tree) can be compiled into a (W)FST P that can be composed with L' to 
yield a (W)FST L" that will transduce input words to sets of 
pronunciations. 

• L" can be composed with a WFST O implementing phoneme-to-phone rules 
(again, either hand-developed or compiled from a trained decision tree) to 
yield L'" = L" o <!>. L'" can then be inverted for use in an ASR system. 
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Word Pronunciation: an Example 

1 Input OTIJ,OB 

2 Annotated Input otij,"ob 

3 Underlying Form oT"{El}ij,{noun}{msc}{an}+"oB{pl}{gen} 

4 Phonological Form ats' of 

• Map 1 to 2 by transducer that freely introduces stress marks ("). 

• Compose lexicon of legal underlying forms, with rules such as 
{El} ^ e / _ {DelCons}{GRAM}*+{NUM} 
{E1}^0 

" ^ 0/_E* " 

Invert the resulting transducer and compose this with 2 to produce 3. 

• Compile pronunciation rules such as 

o ^ o / 
o ^ a 

into transducer that can be composed with 2 to produce 4. 



M.Riley &R. Sproat Text Analysis Tools in SLP, June 27, 1994BasicModules66-l 



Phrasing Prediction 

Problem: predict intonational phrase boundaries in long 
unpunctuated utterences: 

For his part, Clinton told reporters in Little Rock, Ark., on Wednesday 
I that the pact can he a good thing for America \\ if we change our 
economic policy \ \ to rebuild American industry here at home \ \ and if 
we get the kind of guarantees we need on environmental and labor 
standards in Mexico \ \ and a real plan \ \ to help the people who will 
be dislocated by it. 

Previous treatments have used rule-based parsing approaches 
(O'Shaughnessy, 1989; Bachenko & Fitzpatrick, 1990). 

AT&T synthesizer uses a CART-based predictor trained on labeled 
corpora (Hirschberg & Wang 1992). 
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Phrasing Prediction: Variables 

For each < Wi^Wj > : 

• length of utterance; distance of Wi in syllables/ 

stressed syllables/words . . . from the beginning/end of the sentence 

• automatically predicted pitch accent for Wi and Wj 

• part-of-speech (POS) for a 4- word window around < Wi^Wj >; 

• (largest syntactic constituent dominating Wi but not Wj and vice 
versa, and smallest constituent dominating them both) 

• whether < Wi^Wj > is dominated by an NP and, if so, distance of 
Wi from the beginning of that NP, the NP, and distance/length 

• (mutual information scores for a four- word window around 

< Wi.Wj >) 

The most successful of these predictors so far appear to be POS, some 
constituency information, and mutual information 
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Phrasing Prediction: Sample Tree 

new.snp 
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Phrasing Prediction: Results 

Results for multi-speaker read speech: 

- major boundaries only: 91.2% 

- collapsed major/minor phrases: 88.4% 

- 3-way distinction between major, minor and null boundary: 
81.9% 

Results for spontaneous speech: 

- major boundaries only: 88.2% 

- collapsed major/minor phrases: 84.4% 

- 3-way distinction between major, minor and null boundary: 
78.9% 

Results for 85K words of hand-annotated text, cross-validated on 
training data: 95.4%. 
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AT&T Text-to-Speech Synthesis 



TEXT 



Text Preprocessing 




I ^ Phrasal Accents 
I ^ Duration 

Intonation 
Amplitude 



Glottal Source 



I ^ Unit Selection 

I ^ Unit Concatenation 
L^. Synthesis 



SPEECH 
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Representations in Speech Recognition 

Quantized observations: 



Or, 



Phone model A^^ : 



o/:£/poi(0 Oj:£/pi2(0 £:7i/p2/ 

so ) — ) f »iS2 





Oj:£/poo(0 Oj:£/pii(0 o/:£/p22(0 



Acoustic transducer: A = A^^) 
Word pronunciations i^data- 



* 



d:e/l ^ ey:e/>\^_Ydx:e/.8X-^ax:"data"/l. 

t:8/.2 



ae:e/.6< 



Dictionary: = (Xl^ 



* 
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Recognition Cascade 



o — ^ 

observations 




phones 




words 




• Levels: 

- Observations: 0{o) = 1, 0(5 / o) =0 

- Acoustic-phone transduction: A{a^p) = P{a\p) 

- Pronunciation dictionary: D{p^w) = P{p\w) 

- Language model: M{w) = P{w) 

• Recognition: maximize {O o A o D o M){w) 
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Example: Phone Lattice O o A 

Phone lattice for hostile battle: 



5/-8.579 
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Sample Pronunciation Dictionary D 

Dictionary with hostile, battle and bottle as a transducer: 



ax:-/2.607 



ay:-/1.616 



:-/2.466 



:-/0.000 



-:hostile/2.943 



s:-/0.035 




:-/0.014 



■/0.014 
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Sample Language Model M 



Language model as acceptor: 



battle/10.896 



72.306 
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Recognition Output: O o A o D o M 

Apply dictionary to phone lattice to create word lattice, then compose 
word lattice with language model to obtain word lattice with combined 
acoustic/language weights. 




M. Riley & R. Sproat 



Text Analysis Tools in SLP, June 27, 1994 



Applications 



77 



Language Identification 

• Language identification can be approached by simultaneously 
recognizing in N languages and selecting the language with the best 
recognition score. 

• In weakly-constrained task domains, the combined "lexicon" and 
"language model" for each language may need to be correspondingly 
weak (but general) - e.g., phone or syllable n-grams. 

• In more strongly-constrained task domains, more lexical and 
grammatical/semantic constraints can be used as in conventional 
ASR systems, e.g., ranging from word- spotting to full trigram 
language models. 

• Constructing such systems requires multilingual text normalization 
and pronunciation components for training and testing. 
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